{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using TorchSharp to Generate Synthetic Data for a Regression Problem\n",
    "\n",
    "This tutorial is based on a [PyTorch example](https://jamesmccaffrey.wordpress.com/2023/06/09/using-pytorch-to-generate-synthetic-data-for-a-regression-problem/) posted by James D. McCaffrey on his blog, ported to TorchSharp.\n",
    "\n",
    "Synthetic data sets can be very useful when evaluating and choosing a model.\n",
    "\n",
    "Note that we're taking some shortcuts in this example -- rather than writing the data set as a text file that can be loaded from any modeling framework, we're saving the data as serialized TorchSharp tensors. Is should be straight-forward to modify the tutorial to write the data sets as text, instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "#r \"nuget: TorchSharp-cpu\"\n",
    "\n",
    "open TorchSharp\n",
    "open type TorchSharp.TensorExtensionMethods"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generative Network\n",
    "\n",
    "Neural networks can be used to generate data as well as train. The synthetic data can then be used to evaluate different models to see how well they can copy the behavior of the network used to produce the data.\n",
    "\n",
    "First, we will create the model that will be used to generate the synthetic data. Later, we'll construct a second model that will be trained on the data the first model generates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "type Net(n_in : int) as this = \n",
    "    inherit torch.nn.Module<torch.Tensor,torch.Tensor>(\"Net\")\n",
    "\n",
    "    let hid1 = torch.nn.Linear(n_in, 10)\n",
    "    let oupt = torch.nn.Linear(10, 1)\n",
    "\n",
    "    do\n",
    "        let lim = 0.80;\n",
    "        torch.nn.init.uniform_(hid1.weight, -lim, lim) |> ignore\n",
    "        torch.nn.init.uniform_(hid1.bias, -lim, lim) |> ignore\n",
    "        torch.nn.init.uniform_(oupt.weight, -lim, lim) |> ignore\n",
    "        torch.nn.init.uniform_(oupt.bias, -lim, lim) |> ignore\n",
    "        \n",
    "        this.RegisterComponents()\n",
    "\n",
    "    override _.forward(input) = \n",
    "        use _ = torch.NewDisposeScope()\n",
    "        let z = hid1.call(input).tanh_()\n",
    "        let x = oupt.call(z).sigmoid_()\n",
    "        x.MoveToOuterDisposeScope()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have our generative network, we can define the method to create the data set. If you compare this with the PyTorch code, you will notice that we're relying on TorchSharp to generate a whole batch of data at once, rather than looping. We're also using TorchSharp instead of Numpy for the noise-generation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let create_data_file(net: Net, n_in: int64, fileName: string, n_items: int64) =\n",
    "    let x_lo = -1.0\n",
    "    let x_hi = 1.0\n",
    "\n",
    "    let one_hundredth = 0.01.ToScalar()\n",
    "\n",
    "    let X = (x_hi - x_lo).ToScalar() * torch.rand([|n_items; n_in|]) + x_lo.ToScalar()\n",
    "\n",
    "    use d = torch.no_grad()\n",
    "\n",
    "    let mutable y = net.call(X)\n",
    "\n",
    "    y <- y + torch.rand(y.shape) * one_hundredth\n",
    "\n",
    "    y <- torch.where(y.le(torch.tensor(0.0)), y + one_hundredth * torch.randn(y.shape) + one_hundredth, y)\n",
    "\n",
    "    X.save(fileName + \".x\")\n",
    "    y.save(fileName + \".y\")\n",
    "\n",
    "let load_data_file(fileName: string) = (torch.Tensor.load(fileName + \".x\"), torch.Tensor.load(fileName + \".y\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let net = new Net(6)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the data files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "create_data_file(net, 6, \"train.dat\", 2000);\n",
    "create_data_file(net, 6, \"test.dat\", 400);"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Using the Data\n",
    "\n",
    "Load the data from files again. This is just to demonstrate how to get the data from disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let X_train,y_train = load_data_file(\"train.dat\")\n",
    "let X_test, y_test =  load_data_file(\"test.dat\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create another model class, with slightly different logic, and train it on the generated data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "type Net2(n_in : int) as this = \n",
    "    inherit torch.nn.Module<torch.Tensor,torch.Tensor>(\"Net2\")\n",
    "\n",
    "    let hid1 = torch.nn.Linear(n_in, 5)\n",
    "    let oupt = torch.nn.Linear(5, 1)\n",
    "\n",
    "    do\n",
    "        this.RegisterComponents()\n",
    "\n",
    "    override _.forward(input) = \n",
    "        use _ = torch.NewDisposeScope()\n",
    "        let z = hid1.call(input).relu_()\n",
    "        let x = oupt.call(z).sigmoid_()\n",
    "        x.MoveToOuterDisposeScope()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create an instance of the second network, choose a loss to use, and then you're ready to train it. You also need an optimizer and maybe even an LR scheduler."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let model = new Net2(6)\n",
    "\n",
    "let loss = torch.nn.MSELoss()\n",
    "\n",
    "let learning_rate = 0.01\n",
    "let optimizer = torch.optim.Rprop(model.parameters(), learning_rate)\n",
    "let scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A pretty standard training loop. The input is just in one batch. It ends with evaluating the trained model on the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "printf \" initial loss = %s\\n\" (loss.forward(model.forward(X_train), y_train).item<float32>().ToString())\n",
    "\n",
    "for epoch = 1 to 1000 do\n",
    "\n",
    "    let output = loss.forward(model.forward(X_train), y_train)\n",
    "    \n",
    "    // Clear the gradients before doing the back-propagation\n",
    "    model.zero_grad()\n",
    "\n",
    "    // Do back-progatation, which computes all the gradients.\n",
    "    output.backward()\n",
    "\n",
    "    optimizer.step() |> ignore\n",
    "\n",
    "    if epoch % 100 = 99 then\n",
    "        scheduler.step()\n",
    "\n",
    "printf \" final loss   = %s\\n\" (loss.forward(model.forward(X_train), y_train).item<float32>().ToString())\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The thing we're really curious about is how the second model does on the test set, which it didn't see during training. If the loss is significantly greater than the one from the training set, we need to train more, i.e. start another epoch. If the test set loss doesn't get closer to the training set loss with more epochs, we may need more data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "loss.forward(model.forward(X_test), y_test).item<float32>()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Splitting the Data into Batches\n",
    "\n",
    "If we want to be a little bit more advanced, we can split the training set into batches. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let N = X_train.shape[0]/10L\n",
    "let X_batch = X_train.split(N)\n",
    "let y_batch = y_train.split(N)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That means modifying the training loop, too. Running multiple batches can take longer, but the model may converge quicker, so the total time before you have the desired model may still be shorter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "printf \" initial loss = %s\\n\" (loss.forward(model.forward(X_train), y_train).item<float32>().ToString())\n",
    "\n",
    "for epoch = 1 to 1000 do\n",
    "\n",
    "    for j = 0 to X_batch.Length-1 do\n",
    "\n",
    "        let output = loss.forward(model.forward(X_batch[j]), y_batch[j])\n",
    "        \n",
    "        // Clear the gradients before doing the back-propagation\n",
    "        model.zero_grad()\n",
    "\n",
    "        // Do back-progatation, which computes all the gradients.\n",
    "        output.backward()\n",
    "\n",
    "        optimizer.step() |> ignore\n",
    "\n",
    "    scheduler.step()\n",
    "\n",
    "printf \" final loss   = %s\\n\" (loss.forward(model.forward(X_train), y_train).item<float32>().ToString())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "loss.forward(model.forward(X_test), y_test).item<float32>()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Dataset and DataLoader\n",
    "\n",
    "If we wanted to be really advanced, we would use TorchSharp data sets and data loaders, which would allow us to randomize the test data set between epocs (at the end of the outer training loop). Here's how we'd do that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "type SyntheticDataset(fileName: string) as this = \n",
    "    inherit torch.utils.data.Dataset()\n",
    "\n",
    "    let mutable _data:torch.Tensor = torch.Tensor.load(fileName + \".x\")\n",
    "    let mutable _labels:torch.Tensor = torch.Tensor.load(fileName + \".y\")\n",
    "\n",
    "    \n",
    "    override _.GetTensor(index: int64) =\n",
    "        let rdic = new System.Collections.Generic.Dictionary<string, torch.Tensor>()\n",
    "        rdic.Add(\"data\", _data[index])\n",
    "        rdic.Add(\"label\", _labels[index])\n",
    "        rdic\n",
    "\n",
    "    override _.Count = _data.shape[0]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training loop gets slightly more complex with the data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let training_data = new SyntheticDataset(\"train.dat\")\n",
    "let train = new torch.utils.data.DataLoader(training_data, 200, shuffle=true);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "printf \" initial loss = %s\\n\" (loss.forward(model.forward(X_train), y_train).item<float32>().ToString())\n",
    "\n",
    "for epoch = 1 to 1000 do\n",
    "\n",
    "    for data in train do\n",
    "\n",
    "        let output = loss.forward(model.forward(data[\"data\"]), data[\"label\"])\n",
    "        \n",
    "        // Clear the gradients before doing the back-propagation\n",
    "        model.zero_grad()\n",
    "\n",
    "        // Do back-progatation, which computes all the gradients.\n",
    "        output.backward()\n",
    "\n",
    "        optimizer.step() |> ignore\n",
    "\n",
    "    scheduler.step()\n",
    "\n",
    "printf \" final loss   = %s\\n\" (loss.forward(model.forward(X_train), y_train).item<float32>().ToString())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's slower, and the convergence isn't that much better, but that will depend on the model used. You just have to try and try different things."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "loss.forward(model.forward(X_test), y_test).item<float32>()"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
